Molnar et al. BMC Genomics 2014, 15:761 
http://www.biomedcentral.eom/1 471 -21 64/1 5/761 



Genomics 



RESEARCH ARTICLE Open Access 



Genome sequencing and analysis of Mangalica, 
a fatty local pig of Hungary 

Janos Molnar 2,3 , Tibor Nagy 1 , Viktor Steger 1 , Gabor Toth 1,4 , Ferenc Marines 1 " and Endre Barta 1 * 



Abstract 

Background: Mangalicas are fatty type local/rare pig breeds with an increasing presence in the niche pork market 
in Hungary and in other countries. To explore their genetic resources, we have analysed data from next-generation 
sequencing of an individual male from each of three Mangalica breeds along with a local male Duroc pig. Structural 
variations, such as SNPs, INDELs and CNVs, were identified and particular genes with SNP variations were analysed 
with special emphasis on functions related to fat metabolism in pigs. 

Results: More than 60 Gb of sequence data were generated for each of the sequenced individuals, resulting in 
1 1x to 19x autosomal median coverage. After stringent filtering, around six million SNPs, of which approximately 
10% are novel compared to the dbSNP138 database, were identified in each animal. Several hundred thousands 
of INDELs and about 1 ,000 CNV gains were also identified. The functional annotation of genes with exonic, 
non-synonymous SNPs, which are common in all three Mangalicas but are absent in either the reference genome or the 
sequenced Duroc of this study, highlighted 52 genes in lipid metabolism processes. Further analysis revealed that 41 of 
these genes are associated with lipid metabolic or regulatory pathways, 49 are in fat-metabolism and fatness-phenotype 
QTLs and, with the exception of ACACA, ANKRD23, GM2A, KIT, MOGAT2, MTTP, FASN, SGMS1, SLC27A6 and RETSAT, have not 
previously been associated with fat-related phenotypes. 

Conclusions: Genome analysis of Mangalica breeds revealed that local/rare breeds could be a rich source of sequence 
variations not present in cosmopolitan/industrial breeds. The identified Mangalica variations may, therefore, be a very 
useful resource for future studies of agronomically important traits in pigs. 
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Background 

Due to the economic value of farm animals, their gen- 
omics, in general, and whole genome sequencing, in par- 
ticular, are important issues. Results of such research 
have already had an impact and will continue to do so in 
the future in terms of production of meat, milk, fibre 
and other products, environmental effects of animal 
husbandry, breeding, animal health, feeding, and even 
human medical issues such as xenotransplantation and 
disease modelling [1,2], Regarding this, the genome of a 
number of agriculturally important animal species has 
been or is being completed [3-11]. 

Pig is one of the most important farm animals, provid- 
ing about 103,000 thousand tonnes of pork for meat 
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consumption worldwide in 2012 [12], Moreover, pigs can 
be used as a model for human diseases, such as arthritis, 
cardiovascular diseases, diabetes and obesity, because pigs 
are more similar to humans at physiological and gene level, 
when compared with rodent animal models [2]. According 
to different sources, the predicted number of pig breeds 
and lines range from 350 to 730 [13,14], Most of these 
breeds are local, with only 25 found in multiple regions of a 
country, and a further 33 spread to more than one country 
[13]. In spite of the larger number of pig breeds, only six 
(Large White, Duroc, Landrace, Hampshire, Berkshire and 
Pietrain) dominate the pork industry [13]. 

In the last decade, enormous efforts have been made 
to exploit the genetic and genomic resources of pigs. 
Genome sequencing of swine goes back to the early 
2000s, when the Sino-Danish Pig Genome Project was 
initiated and subsequently a 0.66 x coverage genome sur- 
vey, based on shotgun sequencing, was published [15]. 
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Deeper coverage sequencing of the pig genome was initi- 
ated by the Swine Genome Sequencing Consortium 
[16]. The Sscrofa9 genome assembly was released in 
2009 [17] and the pig genome sequence was recently 
published [9]. These genome resources for pig, together 
with specialised sequencing projects such as parallel se- 
quencing, have had a huge impact on widening our know- 
ledge about the pig genome, to include SNP identification 
and genotyping [18-20], GC variance [21], muscle tran- 
scriptome [22,23], pig interactome [24], domestication/se- 
lection [25], evolution/domestication [9], and in a number 
of other recently published research topics [26]. 

Despite the large number of local pig breeds, only a 
few of them (for example Angler Satleschwein, British 
Saddleback, Cinta Senese, Manchado de Jabugo, Basque 
and Guodyerbas), were included in genome sequencing 
projects. In addition to the major industrial and the few 
local breeds, Asian and European wild boars, several 
Asian pig breeds and several other species of the Sus 
genus have also been included [9,27-29]. However, other 
local breeds, of which many are endangered, should also 
be of great interest for genomic studies because of their 
importance in biodiversity, conservation, local commu- 
nity and even pork production issues [14,30]. Mangalica 
is an example of a local/rare breed with a characteristic 
curly hair phenotype, which is indigenous to Hungary 
and was developed in the 19 th century [14]. Mangalicas 
are fatty- type pigs [31], with high intramuscular fat con- 
tent [32]. Mangalicas have three colour variants, Blond, 
Red and Swallow-belly, which are considered as separate 
breeds based on microsatellite studies [33]. As the his- 
tory of the three Mangalica breeds indicate [14], the 
Blond was bred first from old Hungarian pig races and pigs 
of Mediterranean origin, and then it contributed to the 
two newer breeds, Red and Swallow-belly Mangalicas. 
Reproduction studies are quite numerous in Mangalica 
[34-38], but genetic studies are rare [39]. Previously we 
have described that the mtDNA D-loop sequences of Man- 
galicas display low diversity, but the maternal lineages that 
they represent are genetically distant from cosmopolitan 
breeds kept in Hungary [14] and very likely originate from 
one particular European ancient line [40]. 

In order to explore how the genomes of Mangalicas 
differ from the reference pig genome, we have sequenced 
a male individual of each of the three Mangalica breeds 
along with a male Duroc individual of Hungarian origin. 
The genome sequence of Mangalicas can serve as a basis 
for future conservation of the breeds and for an ex- 
tended Mangalica pork industry. 

Results 

Genome sequencing 

Three Mangalica male pigs with a Mangalica-specific 
mitochondrial D-loop haplotype were selected [40] for 



genome sequencing. These animals were kept at Emod, 
Hungary, registered at the Hungarian Mangalica gene- 
bank as pedigree sires. They were previously assessed as 
Blond, Red and Swallow-belly Mangalicas, respectively, 
under the Hungarian Mangalica Standard and by micro- 
satellite analysis. A Duroc male of Hungarian origin was 
also sequenced, because we have found previously that 
Duroc pigs of international or Hungarian origin belong 
to different maternal lineages [40] and Mangalica x 
Duroc Fl hybrids are processed at industrial scale in 
Hungary for pork products. 

Genome sequencing resulted in 6.27 x 10 8 , 4.15 x 10 8 , 
4.06 x 10 8 and 3.32 x 10 8 reads for the genomes of the 
Blond, Red and Swallow-belly Mangalica and the Duroc 
animals, respectively (Table 1). Due to the 500 bp aver- 
age fragment size of the libraries used for the 2 x 100 bp 
paired-end sequencing, 300 bp long spacer between the 
reads was predicted. Mapping of the reads to the refer- 
ence pig genome Sscrofa 10.2 resulted in an excellent 
correspondence between the expected and observed 
length of the spacers (Additional file 1). The proportion 
of the mapped reads was 77.3, 83.3, 82.8 and 82.5% 
resulting in 19x, 14x, 14x and llx median autosomal 
coverage, respectively, for the four sequenced individuals 
(Table 1). The coverage for the individual autosomes 
varied between lOx and 21 x, while for the sex chromo- 
somes about half of the autosomal coverage was ob- 
tained (Figure 1). In addition, large numbers of reads for 
the Blond (260,270), Red (98,832) and Swallow-belly 
(104,478) Mangalicas and the Duroc (100,663) individual 
resulted in 1,571 x, 602x, 638x and 615x coverage of the 
pig reference mitochondrial genome [41], respectively. 

Identification of genetic variants 

To identify SNP and INDEL variants we used the SAM- 
tools and GATK pipelines. In each animal, SAMtools 
and GATK provided a very similar number of SNPs and 
the proportion of the concordant variations was high. In 
contrast, GATK detected more INDELs than SAMtools, 
and thus the proportion of the common INDELs was 
lower compared either to the SNPs or to the total num- 
bers of INDELs identified by the pipelines (Additional 
file 2). We analysed only the concordant variants further. 
More than seven million SNP and INDEL variants were 
identified by comparing the genome of each Mangalica 
individual to the Sscrofa 10.2 genome assembly. The 
genome sequence of the Duroc male also contained al- 
most 6.5 million SNPs and INDELs when compared with 
the reference genome, which was assembled predominantly 
from a Duroc female animal [20]. SNPs outnumbered 
INDEL variations in all four animals by about 10-fold. In 
the Blond Mangalica, more homozygous then heterozygous 
SNPs were identified; in the Red Mangalica their number 
was about the same, while in the Swallow-belly Mangalica 
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Table 1 Sequencing statistics 





BM a 


RM a 


SM a 


D a 


Total reads 


626951708 


414579434 


405954574 


331599252 


Mapped reads 


484893153 


345426413 


335911424 


273390375 


Mapped reads (%) 


77.3 


83.3 


82.8 


82.5 


Autosomal median coverage 


19 


14 


14 


11 



'BM, Blond Mangalica; RM, Red Mangalica; SM, Swallow-belly Mangalica, D, Duroc. 



there were more heterozygous than homozygous SNPs. 
In the Duroc animal, there were more heterozygous 
than homozygous SNPs. In each individual, more homo- 
zygous than heterozygous INDELs were found and their 
ratio was also about the same. SNP transitions were 
more numerous than transversions in all four individ- 
uals by about 2-fold. A summary of the statistics for 
these data are shown in Table 2. 

Filtering the SNP variations using stringent criteria 
(see Methods) resulted in 6.2 x 10 6 , 6.3 x 10 6 , 6.2 x 10 6 
and 5.4 x 10 6 SNPs in the Blond, Red and Swallow-belly 
Mangalica and the Duroc individuals, respectively 
(Additional file 3). Approximately 9 to 13% of the fil- 
tered SNPs were revealed as novel (Additional file 3) 
when compared with the 28.6 million SNPs in the pig 
dbSNP138 database. The filtered SNPs were grouped 
into main and sub-categories according to their inter- 
genic or genie position and synonymous or non- 
synonymous nature (Additional file 4). It was observed 
that Mangalicas, in contrast to the Duroc animal, had 
more homozygous than heterozygous variations in 
almost all SNP categories. A comparison of both syn- 
onymous and non-synonymous exonic SNP variants 
revealed 12,448 SNPs that were common to the four 
animals, and approximately 5,200 to 9,500 unique 
SNPs for each individual (Figure 2). 



The detection of large INDELs was not the scope of 
the current study, and so only INDELs shorter than 
52 bp were identified. For the genomes of the Blond, 
Red, Swallow-belly Mangalicas and the Duroc pig, ap- 
proximately 6.9 x 10 5 , 6.2 x 10 5 , 6.1 x 10 5 and 4.5 x 10 5 
such INDELs were identified, respectively. Of these, 
99.9% were novel compared to the dbSNP138 database. 
With respect to the size distribution, of the INDELs 
among the four genomes, single base-pair INDELs were 
the most abundant (Additional file 5). Exonic INDELs 
were sorted into eight categories: frame-shift deletions, 
frame-shift insertions, frame-shift block substitutions, 
non-frame-shift deletions, non-frame-shift insertions, 
non-frame-shift block substitutions, stop-gains and 
stop-losses (Additional file 6). In exonic INDELs, apart 
from the relatively large number of one base-pair variations 
that cause ORF shifts, +/- 3 base-pair changes, which do 
not effect the ORF, were identified in higher numbers than 
two or four base-pair variations (Additional file 7). An ele- 
vated number of one base-pair INDELs when compared 
with other sizes has also been reported by others [42,43]. 
Our comparison with the platinum human exonic data 
obtained from Illuminas BaseSpace (https://basespace. 
illumina.com/datacentral) provided the same result 
(data not shown) suggesting that our analysis with the 
pig genome is reliable. 
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Table 2 Categories of sequence variations 





BM a 


RM a 


SM a 


D a 


SNPs 


6944767 


6871283 


6734038 


5950027 


INDELs 


696029 


623282 


617600 


451299 


Total variants 


7640796 


7494565 


7351638 


6401326 


Heterozygous SNPs 


3196568 


3443594 


3722921 


3314982 


Homozygous SNPs 


3744651 


3424501 


3008107 


2633108 


SNP transitions 


4782645 


4739843 


4641040 


4097573 


SNP transversions 


2165670 


2134628 


2096008 


1854391 


Multiple SNPs 


3548 


3188 


3010 


1937 


Heterozygous INDELs 


210860 


200062 


167240 


144505 


Homozygous INDELs 


464812 


406643 


436692 


298970 



a BM, Blond Mangalica; RM, Red Mangalica; SM, Swallow-belly Mangalica, 
D, Duroc. 



Copy number variants (CNVs) were identified that 
were common amongst the three sequenced Mangalicas. 
Only CNV gains were analysed further due to the effect 
of sequence coverage depth on CNV losses [44]. One 
thousand and forty-one CNV gains with a copy number 
of three or more were identified across all chromosomes 
(Figure 3). The minimum and maximum size of the 
CNVs was 1,000 and 135,735 bp, respectively with an 
average of 3,529 bp. Of the 1,041 Mangalica CNVs, 485 
and 160 had no positional overlap with either the 3,118 
CNV gains described by Paudel and colleagues [44] or 
the 145,857 CNVs identified in the Duroc animal in this 
study, respectively, while the numbers of overlapping 
CNVs were 556 and 881, respectively. We note here that 
the very large number of CNVs in the Duroc animal is 
because no statistical test could be performed on data 
from one individual. Porcine genes could be annotated 
to 155 CNVs, while 886 CNVs did not contain any gene 



BM SM 




Figure 2 Venn diagram of exonic SNPs in the sequenced 
animals. D, Duroc; BM, Blond Mangalica; SM, Swallow-belly 
Mangalica; RM, Red Mangalica. 

V J 



(Additional file 8). Of the 155 genes, 150 were unique since 
five genes contained two CNVs. An overrepresentation 
analysis identified 16 out of the 150 unique genes, 
which were in the overrepresented Molecular function 
(GO:0003674) category (P value = 1.25 x 10" 7 ). One of 
the 16 genes, HOXB8, encoding a homeobox protein, is 
neither present in the literature [44] nor in the sequenced 
Duroc animal used in this study (Additional file 8). 

Analysis of genes with exonic, non-synonymous SNPs 
Functional, QTL and pathway annotation of the genes 

Due to the importance of the Mangalica x Duroc hybrids 
to the Hungarian pork industry, the 2,328 exonic, 
non- synonymous SNPs common to all three Mangalica 
breeds but absent from sequenced Duroc animal (Figure 4) 
and the reference pig genome, were selected for functional 
analysis. These SNPs in the coding regions of genes, which 
result in amino acid changes in proteins, may be of great 
importance as they could be the polymorphisms affecting 
variation in phenotypes. The 2,328 SNPs were mapped to 
1,389 unique genes of the Sscrofal0.2 assembly as certain 
genes had multiple SNPs (Additional file 9) and their anno- 
tation into biological process (BP) categories by the web- 
based software PANTHER [45] revealed that they belong to 
twelve major GO groups (Figure 5). Since the SNPs were 
identified by comparing Mangalicas, which are fatty-type of 
pigs, and Duroc, which is a lean-type breed, we were par- 
ticularly interested in those SNP-harbouring genes that 
might be involved in fat-related biological processes. 
Amongst the 1,389 unique genes with exonic, non- 
synonymous SNPs, we have identified 52 genes, which 
belonged to Lipid metabolic process (GO:0006629). Al- 
though this category, in contrast to when two sets of 1,389 
randomly chosen genes were used as control, appeared in 
an overrepresentation analysis, it was not overrepresented 
using the strict Bonferroni correction (Additional file 10). 
As another control, we have found no overrepresentation 
using the full pig gene set. Despite the lack of overrepresen- 
tation, we still consider that the identified genes might have 
a great importance, since the amino acid changes caused by 
the SNPs in them may affect the structure and, conse- 
quently, the function of the encoded protein, and such 
functional alterations of proteins remain hidden in gene ex- 
pression studies. The importance of our SNP-based gene 
identification approach is indicated by, for example, that 
proteins encoded by the PNLIP and PNLIPRP2 genes, 
which were not associated to fatness phenotypes in pigs be- 
fore, are the target of Orlistat (tetrahydrolipstatin), a drug 
used for treating obesity in humans (data not shown). The 
possible effect of exonic SNPs on protein function is dis- 
cussed below using FASN as an example. 

To study the possible relationship between the 52 genes 
in the lipid metabolic process GO category and QTLs, the 
chromosomal position of each genes was compared to the 
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Figure 3 Distribution of CNVs across Mangalica chromosomes. Short vertical lines represent the position of CNVs, which are present in all 
three sequenced Mangalicas. 



BM SM 




Figure 4 Venn diagram of exonic, non-synonymous SNPs in 

the sequenced animals. D, Duroc; BM, Blond Mangalica; SM, 

Swallow-belly Mangalica; RM, Red Mangalica. 
\ J 



positions of the "Fatness" and "Fat composition" QTLs 
downloaded from the QTLdb, Release 19, [46]. Forty-nine 
genes are in one or more fat-related QTLs with 14 genes 
on chromosome 14, overlapped by 15 fat-associated QTLs 
(Additional file 11). Because of this large proportion (-28%) 
of genes on chromosome 14, we performed an enrichment 
analysis for the 14-gene set and a control set of 1282 genes, 
both are in the same region of chromosome 14 determined 
by the 15 QTLs. The corrected P-value for lipid metabolic 
genes in the control and in our set was 4.80 x 10" 3 and 
2.95 x 10~ 19 , respectively, indicating that the enrichment of 
the 14 genes in these QTLs deviate significantly from 
random. 

Fatty acid composition of meats is an important diet- 
etic and health issue for pork consumers. We, therefore, 
compared those genes, which are in saturated and unsat- 
urated fatty acid QTLs and found that nine genes were 
in common across both fatty acid categories, while the 
saturated and unsaturated QTL groups each contained 
two unique genes, NKX2-3 and EPHX2, and OMA1 and 
FAM135B, respectively (Additional file 12). 
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Figure 5 Biological process ontology of genes with exonic SNPs found in Mangalica breeds. Of the 1.389 genes, 1,372 resulted in 2,130 
total hits in processes. Percentage indicates the percent of genes in one process against the total number of process hits. 



Of the 52 lipid metabolic process-associated genes, we 
could map 41 to one or more pathways using the KEGG 
database. Almost 44% (18) of the mapped genes were 
associated with lipid metabolic pathways (Figure 6), 
while others contribute to glycan and carbohydrate me- 
tabolisms, biochemical processes at the interface of lipid 
and other metabolic pathways and the regulation of 
lipid metabolism (Additional file 11). Of the 41 mapped 
genes, two are particularly important. One is FASN, 
which encodes an enzyme involved in a number of steps 
in the synthesis of 8 to 16 carbon-chain fatty acids in 
the fatty acids biosynthesis pathway [KEGG:ssc00061]. 
The FASN protein is a homodimeric multifunctional 
enzyme with six catalytic domains, which processes dif- 
ferent steps of cyclic elongation of fatty acids [47]. The 
other gene is SLC27A6, a member of a gene family, 
which is expressed in liver, heart and subcutaneous 
backfat of pig [48]. The encoded protein is a fatty acid 
transporter, which is one of the two membrane proteins 
of the PPAR signalling cascade [KEGG:hsa03320], which 
regulate lipid and fatty acid metabolism, bile acid bio- 
synthesis and adipocyte differentiation, amongst other 
regulated processes [49]. 

Genotyping SNPs in other breeds 

The 90 SNPs in the above described 52 genes were 
present in all three sequenced Mangalicas, but absent 
from the sequenced Duroc and the reference genome. To 



learn about their wider occurrence, we have "e-genotyped" 
55 animals whose genome was sequenced [9] for these 
SNPs. The results indicate that the frequencies of these 
SNPs vary amongst the 55 individuals (Additional file 13). 
Clustering of the average frequencies revealed four clus- 
ters among the individuals, where Mangalica represents a 
separate cluster and European, international/Hungarian 
Duroc, and non-European pigs and/or wild boars com- 
prise the three other related groups (Additional file 14). 
The clear separation of Mangalicas from other breeds by 
those 90 SNPs might have the potential in practical appli- 
cations, such as whole genome selection in breeding. 

It was found that four SNPs are present only in Man- 
galicas, but not in the genotyped individuals (Additonal 
file 13). All of these SNPs are in one gene, MOGAT2 
(ENSSSCG00000014861), which encodes a monoacyl- 
glycerol O-acyltransferase 2 enzyme, and is in several 
back- and belly-fat QTLs and in the "Fat digestion and 
absorption" (KEGG: 04975) pathway (Additional file 
11). It is possible, therefore, that this gene has a particu- 
lar role of the development of the fatty-pig phenotype of 
Mangalicas. 

Some studies have highlighted the importance of the 
FASN gene in pig fatness [50,51]. In this gene, we have 
identified two non-synonymous SNPs, which are present 
in the three sequenced Mangalicas, but not in the refer- 
ence genome and the sequenced Duroc individual used 
in this project. They are also different from those three 
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Figure 6 Fat metabolic pathways and participating genes with Mangalica-specific exonic, non-synonymous SNPs. Lines represent the 
interconnections of the pathways. Arrows indicate where signalling or metabolites (name above the line) affect genes in other pathways. 



SNPs that have been genotyped previously [50]. SNP1 is 
in exon 9 (chromosome 12, position 1,028,766) and is a 
G*C (reference) to A^T (Mangalica) transition, which 
causes a R443Q amino acid change while SNP2 is a C*G 
(reference) to T^A (Mangalica) transition in exon21 
(chromosome 12, position 1,025,096) resulting in a T 10881 
change in the FASN protein. The frequency of these two 
SNPs is quite diverse in the genome sequenced animals, in- 
cluding the three Mangalicas and one Duroc individual se- 
quenced in this study (Additional file 15). We, therefore, 
genotyped 72 Mangalica and 21 Duroc pigs for both SNPs 
in order to get more information about these SNPs in the 
two breeds. We found that the A ("Mangalica") alleles 
(SNP1 A . T or SNP2 T . A ) occurs at a much higher frequency 
than the B alleles in Mangalica, whereas in contrast the B 
alleles (SNPl G .c or SNP2 C . G , "non-Mangalica") are more 
prevalent in Duroc (Table 3). Additionally, we found that 
for SNP1, 62 and 10 Mangalicas and 1 and 20 Duroc ani- 
mals were AA and BB homozygous, respectively; no hetero- 
zygotes were found. For SNP2, 65 Mangalicas and eight 
Durocs had AA, five Durocs had BB, and seven Mangalicas 
and eight Durocs had AB genotypes respectively; no 
Mangalica with BB genotype was found. 

Discussion 

The genome of one individual each of the three Mangalica 
breeds (Blond, Red and Swallow-belly), and a Duroc animal 



from a Hungarian herd was sequenced and analysed. More 
than 100 million reads were obtained from the genome of 
each animal. On average for the four genomes sequenced, 
81% of the reads were mapped to the reference genome, 
resulting in 14.5 x median autosomal coverage. Millions of 
SNP and hundred-thousands of INDEL variations were 
identified in the three Mangalicas and the one Duroc gen- 
ome, respectively, when compared to the reference pig gen- 
ome assembly Sscrofa 10.2. By filtering the SNPs, about five 
to six million variations were obtained, and about one- 
tenth of these were novel SNPs compared to the dbSNP138 
database (Additional file 3). 

For functional analysis, we selected 2,328 exonic non- 
synonymous SNPs present in each sequenced Mangalica 



Table 3 Genotyping the FASN gene 





Breed 


Allele (Nucleotide) 


Allele frequency 


SNP1 


Mangalica 


A (A-T) 


0.86 




Mangalica 


B (G-C) 


0.14 




Duroc 


A (A-T) 


0.05 




Duroc 


B (G-C) 


0.95 


SNP2 


Mangalica 


act-a) 


0.95 




Mangalica 


B (GG) 


0.05 




Duroc 


A(T-A) 


0.05 




Duroc 


B (GG) 


0.95 
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individual, but absent from either the reference genome 
or the Hungarian Duroc animal These SNPs were mapped 
to 1,389 pig genes present in the Ensembl database. Since 
Mangalicas are fatty-type pigs, and the SNPs were identified 
in comparison with Duroc, a lean-type pig, we were par- 
ticularly interested in fat-related genes in this set. Fifty-two 
genes were found belonging to lipid-related metabolic 
process categories and were further analysed using QTL 
and pathway data-mining. Of the 52 genes, 49 and 41 are 
associated with fat-related QTL regions and KEGG path- 
ways, respectively (Additional file 11). 

Some of the 52 genes, for example ACACA, ANKRD23, 
GM2A, KIT, MOGAT2, MTTP, FASH SGMS1, SLC27A6 
and RETSAT, which we have highlighted here, have been 
previously described in the context of fat-related character- 
istics in pigs [50-54]. Of these genes, FASN, a gene encod- 
ing a fatty acid synthase, has been shown to be associated 
with a ds-ll-Eicosenoic acid (C20:l) percentage QTL in a 
Guadyerbas x Landrace cross, although none of the identi- 
fied SNPs had any putative effect on the protein structure 
[50]. The FASN protein is a homodimeric, multifunctional 
enzyme with six catalytic domains, which are required for 
the cyclic elongation of fatty acids [47] and catalyses 32 re- 
actions in the fatty acid biosynthesis [KEGG:ssc00061] 
pathway. Targeted mutagenesis of the FASN gene and in- 
hibition of the FASN protein in mice resulted in reduced 
total body fat [55] and body weight [56], respectively. We 
have identified two SNPs in this gene in Mangalicas that re- 
sult in a R443Q (SNP1) and a T1088I (SNP2) amino acid 
change. The amino acid in position 443 is part of the a- 
helix in the proteins inter-domain linker. Since glutamine 
is more hydrophilic than arginine, the amino acid substitu- 
tion may affect the relative position of the two functional 
domains by modulating the flexibility of the linker connect- 
ing them [57]. The amino acid in position 1,088 is part of 
the dehydratase domain of the FASN protein. This domain 
catalyses the conversion of (3-hydroxyacyl-ACP to (3-enoyl- 
ACP in the cyclic elongation of fatty acids [47]. T 1088 is in 
close vicinity to the active site of the dehydratase domain 
containing an open-ended hydrophobic tunnel [57]. Pre- 
dicting hydrophobicity of amino acids along the FASN 
polypeptide revealed that the substituting I 1088 is strongly 
hydrophobic, while T 1088 is hydrophilic (data not shown). It 
is possible, therefore, that in the FASNtiossi protein the 
substrate-binding nature of the active site is altered, which 
may influence the dehydration step of the fatty acid 
cyclic elongation. This might be particularly important 
in Mangalicas, where no BB homozygotes were found. 
Thus the active site in the catalytic domain of their 
FASN protein is expected to be hydrophobic, although 
allele-specific expression of the FASN gene in hetero- 
zygotes might influence this. 

It is known that feeding regimes influence fatty acid 
composition and meats marbling in Mangalicas [31,58], 



similar to other pig breeds and farm animals. In lipid 
metabolism, the "Fat digestion and absorption" and "Bile 
secretion" pathways are involved in the metabolism of 
dietary fats. These two pathways are connected to the 
"Glycerolipid metabolism", "Fatty acid metabolism" and 
"Fatty acid biosynthesis" pathways. Our study highlighted a 
number of genes in these metabolic pathways and in the 
PPAR signalling pathway (Figure 6). We have identified 
one gene, MOGAT2 (ENSSSCG00000014861), with seven 
SNPs, of which four are present in Mangalicas, but 
not in other 56 sequenced pig individuals (see Results). 
The MOGAT2 protein catalyses the conversion of 
1-acylglycerol obtained from dietary fat into diacylglyc- 
erol in the smooth endoplasmatic reticulum of the 
small intestinal epithelial cells, and thus participates in 
the production of chylomicron ("Fat digestion and 
absorption" pathway, KEGG:04975). Chylomicron af- 
fects the PPAR signalling pathway, which in turn 
regulates a number of lipid metabolic processes 
(Figure 6). It is possible, therefore, that polymor- 
phisms that affect genes in this complex networks of 
pathways, which are also part of relevant QTLs, may 
be responsible for the differences in fattening, fat com- 
position and any related phenotypes that were ob- 
served between breeds in response to different feeding 
regimes. For example, the MOGAT2 gene was found to 
be part of the lipid concentration biological function, 
modulated in backfat [54]. 

Conclusions 

The discovery of genes behind agriculturally important 
traits is a difficult task in farm animals, in particular 
when the intermediate- or end-phenotypes are deter- 
mined by QTLs. In this study, we described the gen- 
ome sequencing and analysis of three Hungarian 
Mangalica individuals representing each of the three 
Mangalica breeds, which are local, fatty type pigs with 
a niche role in the pork market. After filtering, millions 
of SNPs were identified in each animal compared to the 
reference genome, and about 10% of them are novel 
compared to the porcine SNP entries of the dbSNP138 
database. This finding highlights that sequencing genomes 
of individuals of rare/local breeds can provide large 
amounts of data identifying genomic variations relative to 
the reference genome of the same species. These variations 
can be the basis for gene discoveries. With special emphasis 
on pig fatness, by annotating and comparing exonic, non- 
synonymous Mangalica-specific SNPs to QTLs and path- 
ways, we identified a number of candidate genes, which can 
serve for future genotyping, expression, structure-function, 
and biological network studies and in applications, such as 
molecular breeding and meat identification or tracing in 
both Mangalica and other breeds. 
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Methods 

Genome sequencing 

Pig blood samples were obtained from the MAN- 
GFOOD consortiums Biobank at the Agricultural Bio- 
technology Center, Godollo, Hungary. Total DNA was 
extracted using the Duplica® Prep Automatic Extraction 
System and the Duplica® Blood DNA kit (EuroClone, 
Milan, Italy). DNA concentration was measured using 
the Quant-iT™ PicoGreen dsDNA® Assay (Life Technolo- 
gies, Budapest, Hungary). Preparation of 500 bp fragment 
libraries and 2 x 100 bp Illumina paired-end genome se- 
quencing was performed by Aros Applied Biotechnology 
(Aarhus, Denmark) as a custom service, using Illuminas 
HiSeq2000 platform. 

Data analyses 

The Sus scrofa reference genome sequence 10.2 was 
indexed using the "bwtsw" algorithm option of BWA 
0.5.9rcl [59] followed by mapping the short sequence 
reads to the indexed genome using the default settings 
and the paired-end method of the same software. The 
obtained BAM files were sorted and indexed for further 
analyses. 

To detect small genetic variants (SNPs and INDELs), 
the SAMtools [60] and GATK (version: 2.3-9-ge5ebf34) 
[61] variant calling pipelines were employed. In SAM- 
tools, base-calling was performed using the "mpileup" 
command and the "-E -D -S -u" parameters of SAMtools 
0.1.18. The "view" command of BCFtools was used to 
call the variants using the "-bvcg" parameters. VCF files 
were then generated by the "vcfutils.pl" script using the 
"varFilter" option and SNPs and INDELs were extracted. 
Finally, SNPs, which had a Phred score higher than 30 (i.e. 
their base-calling accuracy is larger then 99.9%), and a 
high-quality read coverage of minimum three, were filtered 
using a custom script. INDELs were used in downstream 
analyses without filtering. For GATK, the dbSNP138 data 
were used as a training set. Other settings were used ac- 
cording to the GATK best practice online documentation. 
Results obtained by the two pipelines were compared using 
the BEDTools' [62] "intersectBed" module for SNPs and 
using our custom script for INDELS; only concordant vari- 
ations were processed further. 

Copy number variations (CNVs) were detected as de- 
scribed by Paudel and coworkers [44] using the mrCa- 
NaVar (version 0.51) software [63]. The window size was 
set to 1,000 bp. We selected windows where the copy 
number and the standard deviation were bigger than 
three and 0.7, respectively, for the three Mangalicas. 
After that step the regions were chained. 

To determine novel variants in our sequence data, we 
compared the identified SNPs and INDELs with the 
dbSNP138 data using BEDTools [62] and annotated the de- 
tected genetic variants using ANNOVAR [64]. Following 



the ANNOVAR analysis, non-synonymous exonic SNPs, 
which were present only in Mangalicas, were determined 
by BEDTools' "multilntersectBed" module. Genes carrying 
these variants were identified using a custom script. Com- 
parison of SNPs in the lipid metabolism genes amongst 
genome sequenced animals (this study and literature 44) 
were also performed using the "multilntersectBed" module 
of BEDTools. 

Gene ontology analysis was performed by the web- 
based software PANTHER [45]. For overrepresentation 
analyses, Biomart s [65] enrichment analysis option with 
0.05 cut off P-value was employed using the Sscrofa 10.2 
reference genome as background. Random sets of genes 
was generated by a custom Python script. Fat-related pig 
QTLs and their positions were downloaded from the 
QTLdb (Release 19) database [46], and their extension 
was compared with the position of the SNPs of selected 
genes manually. Genes were annotated into pathways 
using the KEGG database. 

Data from Ensembl were retrieved using BioMart [65]; 
Venn diagrams were generated using the software Venny 
[66]; clustering was performed using CIMminer [67] 
with Manhattan distance and complete linkage cluster- 
ing settings. 

Genotyping 

To genotype the two Mangalica-specific SNPs in the FASN 
gene, High Resolution Melting (HRM) analysis was per- 
formed with a Rotor-Gene Q 5plex HRM Platform using a 
saturating dye (EvaGreen) technology (Qiagen, Hilden, 
Germany). PCR reactions were performed in 25 ul reaction 
volumes using 60 ng total DNA as template and the Type- 
it HRM PCR kit (Qiagen, Hilden, Germany), according to 
the instruction of the manufacturer. The primers for FASN 
SNP1 and SNP2 were FASN1_F: 5' CGCGATCTCGTT 
GAGCAT 3', FASN1_R: 5' GTGCAGACCCTGCTGGAG 
3' and FASN2_F: 5' GGATAGGCTTGAGATGCTCTT 
3', FASN2_R 5' GTGGTGGTGGACAGGAATCT 3', re- 
spectively. Reactions were carried out with an initial de- 
naturation step at 95°C for 5 min, followed by 35 cycles of 
95°C for 15 sec, 60°C for 30 sec and 72°C for 10 sec and 
then HRM curves were generated by acquiring florescence 
data between 80 and 91°C. Individuals with homozygous 
and heterozygous genotypes were assigned according to 
their HRM curve determined by the Rotor-Gene software 
and visual inspection. 

Availability of supporting data 

The data sets supporting the results of this article are 
included within the article and its additional files. Se- 
quence data are deposited to the NCBI Sequence Read 
Archive under identifier SRP039012. 
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Additional files 



Additional file 1: Figure SI. Distribution of insert length in paired-end 
sequencing. Figure showing the distribution of insert length and number 
of reads in four sequenced pig individuals. 

Additional file 2: Figure S2. Comparison of SNPs and INDELs detected 
by SAMtools and GATK. The numbers in the overlapping areas represent 
the absolute number of concordant variants, while coloured numbers 
represent the percentage of unique variants. BM, Blond Mangalica; RM, 
Red Mangalica; SM, Swallow-belly Mangalica; D, Duroc. 

Additional file 3: Table SI and S2. Number of filtered SNPs Table SI. 

The number of filtered SNPs in four sequenced pig individuals. Table S2. 
The number of filtered SNPs in the four animals that are present in the 
dbSNP 138 database. BM, Blond Mangalica; RM, Red Mangalica; SM, 
Swallow-belly Mangalica; D, Duroc. 

Additional file 4: Table S3. Annotation of SNPs. Table showing the 
annotation of SNPs into categories and the ratio of heterozygous- 
homozygous SNPs in each category. 

Additional file 5: Figure S3. Distribution of the size of INDELs. Figure 
showing the number of INDELs with respect to their size in four 
sequenced pig individuals. 

Additional file 6: Table S4. Annotation of INDELs. Table showing the 
number and percentage of INDELs annotated into categories. 

Additional file 7: Figure S4. Distribution of the size of exonic frame-shift 
INDELs. Figure showing the number of exonic frame-shift INDELs with 
respect to their size in four sequenced pig individuals. 

Additional file 8: Table S5. Copy number variations in Mangalicas. 
Table showing CNVs found in Mangalica pigs, their overlap with a Duroc 
animal and published data [44], and the overrepresented GO category. 

Additional file 9: Table S6. SNP variations in Mangalicas. Table 
showing SNPs identified in Mangalicas, which are not present in the 
dbSNP138 and the sequenced Duroc animal used in this study. Multiple 
SNPs of the same gene are shown in separate rows with gene ID in grey. 
Transcript and alternative transcript annotation of the SNP are shown in 
the "SNP Annotation" column. 

Additional file 10: Table S7. Overrepresentation analysis of genes with 
exonic non-synonymous SNPs in Mangalicas. Represented GO categories of 
1,389 genes, which carry SNPs identified in the three sequenced Mangalica 
individuals, and of two same size control pig gene sets. Overrepresentation 
can be considered statistically significant where the corrected P value is 
smaller than 5E-2. 

Additional file 11: Table S8. Selected genes with Mangalica-specific 
exonic non-synonymous SNPs. Table showing the genes with 
Mangalica-specific SNPs that belong to the GO:0006629 lipid 
metabolic process category, their annotation to pathways and the 
QTL regions that overlap them. 

Additional file 12: Table S9. Genes with Mangalica-specific SNPs in 
"Fatty acid percentage" QTLs. Table showing the name, Ensembl Gene ID, 
description and associated fatty acid percentage QTLs of genes with 
Mangalica-specific SNPs and belonging to the GO:0006629 lipid metabolic 
process category. 

Additional file 13: Table S10. Occurrence of SNPs in 59 genome 
sequenced pig individuals. Table showing the presence (1) or absence (0) 
of those exonic, non-synonymous SNPs, which were identified in all three 
genome-sequenced Mangalicas, but not in the sequenced Duroc and 
the reference genome, and which are lipid metabolic genes. The genome 
sequenced individuals [9] are shown by their identification number and 
breed/species name. 

Additional file 14: Figure S5. Heat-map of the frequency of 82 SNPs in 
genome sequenced pigs. Figure showing the clustering of 82 SNPs by 
their frequency in pig breeds/species. Four distinct clusters can be 
observed consisting of Mangalicas, European pigs/wild boars, Duroc of 
different origin and non-European pigs/wild boars/related species. The 
order of SPNs from top to bottom corresponds to those in Table S10 
(Additional file 13). 



Additional file 15: Figure S6. SNP frequencies in the FASN gene. 
Figure showing the frequencies of two SNPs in 59 genome sequenced 
pigs, including the four in this study and 55 published by Groenen et al. 
[9]. The number of individuals is shown with brackets after each name. 
Sus, three species, Sus cebifrons, Sus celebensis and Sus verrucosus from 
the genus; Other, Bearded pig and warthog; WE, Western European; 
Nature total, overall frequency of the published [9] 55 individuals. 
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